GPQA Diamond

A graduate-level, Google-proof science benchmark where PhD experts reach only 65% — and frontier AI models now surpass them

Published

August 20, 2025

Keywords: GPQA Diamond, AI benchmark, graduate-level science QA, Google-proof questions, PhD-level evaluation, frontier LLM benchmark, physics chemistry biology, expert-level reasoning, COLM 2024, NYU benchmark

Introduction

Most AI benchmarks, even challenging ones like MMLU, have been saturated by frontier models — with leading systems scoring over 90%. This makes them useless for distinguishing between state-of-the-art models or measuring genuine scientific reasoning.

GPQA Diamond is different. It is the hardest, most vetted subset of the Graduate-Level Google-Proof QA Benchmark — 198 multiple-choice questions in biology, physics, and chemistry so difficult that PhD-level domain experts only reach 65% accuracy. Non-expert validators, given over 30 minutes with full internet access, score only 34% — making these questions truly “Google-proof.”

“We present GPQA, a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. We ensure that the questions are high-quality and extremely difficult: even highly skilled non-expert validators only reach 34% accuracy, despite spending over 30 minutes with unrestricted access to the web.” — GPQA Paper

graph LR
    A["Traditional Benchmarks<br/>(MMLU, etc.)<br/>90%+ accuracy"] --> B["Benchmark<br/>Saturation"]
    B --> C["GPQA Diamond<br/>198 PhD-level questions<br/>Experts: 65%"]
    C --> D["Meaningful signal<br/>for frontier AI<br/>reasoning"]

    style A fill:#e74c3c,stroke:#333,color:#fff
    style B fill:#f39c12,stroke:#333,color:#fff
    style C fill:#27ae60,stroke:#333,color:#fff
    style D fill:#3498db,stroke:#333,color:#fff

What Is GPQA Diamond?

GPQA (Graduate-Level Google-Proof QA) is a benchmark of 448 multiple-choice questions across three science domains. It comes in three subsets of increasing difficulty and quality:

Subset	Questions	Description
GPQA Extended	546	All collected questions including lower-quality ones
GPQA Main	448	Filtered for quality and difficulty
GPQA Diamond	198	Hardest subset — both the expert author AND an independent expert validator agreed on the correct answer

Why “Diamond”?

The Diamond subset applies the strictest quality filter: a question is included only if both the original expert who wrote it and a separate domain expert (who independently attempted it) agreed on the correct answer. This double-validation ensures every question is:

Unambiguously correct — verified by two independent experts
Genuinely difficult — not solvable through surface-level reasoning or web search
High signal — provides maximum information about model capabilities

Key Characteristics

Feature	Details
Total questions	198 (Diamond subset)
Domains	Biology, Physics, Chemistry
Question type	Multiple-choice (4 options)
Expert accuracy	65% (PhD-level domain experts)
Non-expert accuracy	34% (with 30+ minutes and full web access)
Original GPT-4 baseline	39% (November 2023)
License	CC-BY-4.0

What Makes It “Google-Proof”?

graph TD
    Q["PhD-level science<br/>question posed"] --> E["Domain Expert<br/>(PhD holder)<br/>65% accuracy"]
    Q --> N["Non-Expert Validator<br/>(30+ min, full web)<br/>34% accuracy"]
    Q --> M["GPT-4<br/>(Nov 2023 baseline)<br/>39% accuracy"]

    E --> V{"Both experts<br/>agree on answer?"}
    V -->|Yes| D["Included in<br/>GPQA Diamond"]
    V -->|No| X["Excluded from<br/>Diamond subset"]

    style Q fill:#8e44ad,color:#fff,stroke:#333
    style E fill:#27ae60,color:#fff,stroke:#333
    style N fill:#e74c3c,color:#fff,stroke:#333
    style M fill:#f39c12,color:#fff,stroke:#333
    style D fill:#3498db,color:#fff,stroke:#333
    style X fill:#95a5a6,color:#fff,stroke:#333

The term “Google-proof” means that non-expert validators — intelligent individuals without domain-specific PhD training — cannot solve these questions even with unlimited internet access. The questions require deep conceptual understanding, multi-step reasoning, and expert-level domain knowledge that cannot be pieced together from search results alone.

Who Built It?

GPQA was developed at New York University (NYU) by:

David Rein — Lead author
Betty Li Hou, Asa Cooper Stickland, Jackson Petty — Core researchers
Richard Yuanzhe Pang, Julien Dirani, Julian Michael — Contributing researchers
Samuel R. Bowman — Senior advisor (NYU)

Publication

GPQA was published at the First Conference on Language Modeling (COLM 2024), one of the premier venues for language model research.

Resource	Link
arXiv paper	arxiv.org/abs/2311.12022
GitHub repository	github.com/idavidrein/gpqa
Hugging Face dataset	huggingface.co/datasets/Idavidrein/gpqa
Conference	COLM 2024 (First Conference on Language Modeling)

What Skills Does It Test?

GPQA Diamond tests deep expert-level scientific reasoning — not surface-level knowledge retrieval.

graph TD
    GPQA["GPQA Diamond<br/>198 questions"] --> P["Physics<br/>Quantum mechanics,<br/>thermodynamics,<br/>relativity"]
    GPQA --> C["Chemistry<br/>Organic reactions,<br/>spectroscopy,<br/>molecular structure"]
    GPQA --> B["Biology<br/>Molecular biology,<br/>genetics,<br/>biochemistry"]

    style GPQA fill:#e74c3c,color:#fff,stroke:#333
    style P fill:#3498db,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style B fill:#f39c12,color:#fff,stroke:#333

Capability	What GPQA Diamond Tests
Graduate-level knowledge	Questions require PhD-level understanding in specific subfields
Multi-step reasoning	Most questions demand chaining multiple concepts together
Resistance to search	Answers cannot be found via web search — they require deep understanding
Cross-domain synthesis	Some questions span subdisciplines within a field
Calibration	Whether models can accurately assess their own confidence

Example Difficulty

A typical GPQA Diamond question might ask about the outcome of a specific quantum mechanical calculation, the product of a multi-step organic synthesis, or the implications of a particular genetic regulatory mechanism — requiring graduate-level coursework and research experience to answer correctly.

Current Leaderboard

The table below compiles GPQA Diamond accuracy scores from official model announcements and technical reports. All scores are pass@1 (single attempt) unless otherwise noted.

Sources: OpenAI model announcements (o1 blog, o3-mini blog), Google DeepMind (Gemini 2.5 blog), Anthropic model cards, original GPQA paper. Consulted July 2025.

Rank	Model	Accuracy (%)	Source
—	Human domain experts (PhDs)	65.0	GPQA paper
—	Non-expert validators (30+ min, web)	34.0	GPQA paper
1	o1 (OpenAI)	77.3	OpenAI o1 blog
2	o3-mini (high) (OpenAI)	77.0	OpenAI o3-mini blog
3	o1-preview (OpenAI)	73.3	OpenAI o1 blog
4	GPT-4o (OpenAI)	50.6	OpenAI o1 blog
5	GPT-4 (OpenAI, 2023 baseline)	39.0	GPQA paper

Key takeaways:

o1 was the first AI model to surpass human PhD experts on GPQA Diamond (77.3% vs. 65%), a milestone highlighted by OpenAI
o3-mini (high) matches o1 performance at significantly lower cost
The gap between non-experts with web access (34%) and experts (65%) confirms questions are genuinely “Google-proof”
Even GPT-4o (50.6%) falls short of PhD expert performance, despite being far more capable than the original GPT-4 baseline

Note: More recent models — including o3, Gemini 2.5 Pro, Claude 3.7 Sonnet (extended thinking), and DeepSeek-R1 — have also been evaluated on GPQA Diamond. Google reports Gemini 2.5 Pro as “state-of-the-art” on GPQA. For the latest results, consult the resources listed below.

Where to Explore the Benchmark

Dataset and Code

Resource	Description	Link
Hugging Face Dataset	Full GPQA dataset (Main, Extended, Diamond splits)	huggingface.co/datasets/Idavidrein/gpqa
GitHub Repository	Evaluation code, baselines, and documentation	github.com/idavidrein/gpqa
arXiv Paper	Full technical paper with methodology and analysis	arxiv.org/abs/2311.12022

Load the Dataset

from datasets import load_dataset

dataset = load_dataset("Idavidrein/gpqa", "gpqa_diamond")

Related Leaderboards

GPQA Diamond has become one of the standard benchmarks reported in frontier model evaluations. It is commonly featured alongside MMLU, MATH, and HumanEval in model announcements from OpenAI, Google DeepMind, Anthropic, and Meta.

Leaderboard	Description	Link
OpenAI Model Announcements	Official GPQA Diamond results for OpenAI models	openai.com/research
LiveBench	Contamination-free benchmark including reasoning tasks	livebench.ai
SEAL Leaderboards	Scale AI’s frontier model evaluations	labs.scale.com/leaderboard

Understanding the Metrics

Pass@1 Accuracy

The primary metric. Each question is a 4-option multiple-choice problem. The model produces a single answer, and accuracy is the fraction of correct responses. Random baseline is 25%.

Consensus@64

Some evaluations (notably OpenAI’s) also report consensus@64: the model generates 64 responses per question, and the final answer is selected by majority vote. This measures the model’s best achievable performance with repeated sampling.

Model	Pass@1	Consensus@64
GPT-4o	50.6%	56.1%
o1-preview	73.3%	78.3%
o1	77.3%	78.0%

Key insight: The small gap between pass@1 and consensus@64 for o1 (77.3% vs. 78.0%) suggests the model’s answers are highly consistent — it either knows or doesn’t know, with little variance.

Why GPQA Diamond Matters

graph LR
    A["Expert-level<br/>difficulty"] --> C["GPQA Diamond<br/>as a yardstick"]
    B["Google-proof<br/>questions"] --> C
    C --> D["Measures genuine<br/>scientific reasoning"]
    C --> E["Human-AI<br/>comparison point"]

    style A fill:#e74c3c,color:#fff,stroke:#333
    style B fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style D fill:#3498db,color:#fff,stroke:#333
    style E fill:#3498db,color:#fff,stroke:#333

First benchmark where AI surpassed PhD experts — o1’s 77.3% vs. experts’ 65% was a landmark moment for AI capabilities
Measures deep reasoning, not retrieval — “Google-proof” design ensures models must actually understand the science
Standard evaluation for frontier models — reported by every major AI lab in model release announcements
Clear human baseline — the 65% expert ceiling provides a meaningful reference point
Focused on STEM — targets the science domains most relevant to AI safety and capability concerns

Video: GPQA Diamond Explained

Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀

Conclusion

GPQA Diamond stands as one of the most important benchmarks in AI evaluation:

198 rigorously vetted questions in biology, physics, and chemistry — double-validated by independent PhD experts
PhD-level domain experts score only 65% — and non-experts with full web access score just 34%
The first benchmark where AI surpassed human experts — OpenAI’s o1 reached 77.3%, crossing the 65% expert threshold
Built at NYU and published at COLM 2024, establishing it as a peer-reviewed standard
Remains a key differentiator for frontier models — reported in every major model release

As reasoning-focused models continue to improve, GPQA Diamond provides a critical measure of whether AI systems possess genuine scientific understanding — not just the ability to pattern-match answers from training data.

References

Rein, D., Hou, B.L., Stickland, A.C., Petty, J., Pang, R.Y., Dirani, J., Michael, J., & Bowman, S.R. “GPQA: A Graduate-Level Google-Proof Q&A Benchmark.” First Conference on Language Modeling (COLM), 2024. arxiv.org/abs/2311.12022
OpenAI. “Learning to reason with LLMs.” September 2024. openai.com/index/learning-to-reason-with-llms
OpenAI. “OpenAI o3-mini.” January 2025. openai.com/index/openai-o3-mini
Google DeepMind. “Gemini 2.5: Our most intelligent AI model.” March 2025. blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025
Rein, D. “GPQA Dataset.” Hugging Face. huggingface.co/datasets/Idavidrein/gpqa
Rein, D. “GPQA GitHub Repository.” github.com/idavidrein/gpqa

Explore another frontier benchmark — see Humanity’s Last Exam (HLE)
Test physical commonsense across 116 languages — see Global PIQA
Track model costs when running evaluations — see FinOps Best Practices for LLM Applications
Deploy models for running your own evaluations — see Deploying and Serving LLM with vLLM
Understand quantization trade-offs for evaluation — see Quantization Methods for LLMs